pkg/stanza: add backpropagation of number of processed entries #16452

sumo-drosiek · 2022-11-23T12:29:51Z

Description:

This is fix for #15378

Idea is to backpropagate information how many entries has been processed by pipeline

Link to tracking Issue:
#15378

Testing:
Manual
Updated unit tests

Documentation:
In code comments

djaglowski

This looks good to me in general.

The unit tests are great, but I think given the overall purpose of this change, we should also have a few tests cases that validate this the return values are propagated back through an entire pipeline.

sumo-drosiek · 2022-12-05T07:31:18Z

@djaglowski Do you have an idea how to do it technically? Not really sure how the test scenario should look like.

Do you want me to create a receiver which consists of several operators, and then validate return value of the last receiver?

djaglowski · 2022-12-05T14:09:42Z

You can build and start a pipeline at the pkg/stanza level that is basically just a slice of operators. (Default operator not required)

github-actions · 2022-12-20T05:17:25Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

runforesight · 2022-12-27T10:47:55Z

Foresight Summary

Major Impacts

build-and-test-windows duration(4 seconds) has decreased 45 minutes 13 seconds compared to main branch avg(45 minutes 17 seconds).

View More Details

⭕ build-and-test-windows workflow has finished in 4 seconds (45 minutes 13 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
windows-unittest-matrix	- 🔗	N/A	See Details
windows-unittest	- 🔗	N/A	See Details

✅ tracegen workflow has finished in 1 minute 5 seconds (2 minutes 54 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
build-dev	- 🔗	N/A	See Details
publish-latest	- 🔗	N/A	See Details
publish-stable	- 🔗	N/A	See Details

✅ check-links workflow has finished in 1 minute 37 seconds (1 minute 59 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
changed files	- 🔗	N/A	See Details
check-links	- 🔗	N/A	See Details

✅ changelog workflow has finished in 2 minutes 12 seconds (5 minutes 51 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
changelog	- 🔗	N/A	See Details

✅ prometheus-compliance-tests workflow has finished in 3 minutes 36 seconds (6 minutes 38 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
prometheus-compliance-tests	- 🔗	✅ 21 ❌ 0 ⏭ 0 🔗	See Details

✅ build-and-test workflow has finished in 35 minutes 41 seconds (22 minutes 39 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
unittest-matrix (1.19, internal)	- 🔗	✅ 597 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, internal)	- 🔗	✅ 597 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, processor)	- 🔗	✅ 1476 ❌ 0 ⏭ 0 🔗	See Details
correctness-metrics	- 🔗	✅ 2 ❌ 0 ⏭ 0 🔗	See Details
correctness-traces	- 🔗	✅ 17 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, processor)	- 🔗	✅ 1476 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, extension)	- 🔗	✅ 528 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, extension)	- 🔗	✅ 528 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, receiver-0)	- 🔗	✅ 2563 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, receiver-0)	- 🔗	✅ 2563 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, exporter)	- 🔗	✅ 2450 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, exporter)	- 🔗	✅ 2450 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, other)	- 🔗	✅ 4401 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, other)	- 🔗	✅ 4401 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.18, receiver-1)	- 🔗	✅ 1886 ❌ 0 ⏭ 0 🔗	See Details
unittest-matrix (1.19, receiver-1)	- 🔗	✅ 1886 ❌ 0 ⏭ 0 🔗	See Details
integration-tests	- 🔗	✅ 55 ❌ 0 ⏭ 0 🔗	See Details
setup-environment	- 🔗	N/A	See Details
check-collector-module-version	- 🔗	N/A	See Details
check-codeowners	- 🔗	N/A	See Details
lint-matrix (receiver-0)	- 🔗	N/A	See Details
lint-matrix (receiver-1)	- 🔗	N/A	See Details
lint-matrix (processor)	- 🔗	N/A	See Details
lint-matrix (exporter)	- 🔗	N/A	See Details
lint-matrix (extension)	- 🔗	N/A	See Details
lint-matrix (internal)	- 🔗	N/A	See Details
lint-matrix (other)	- 🔗	N/A	See Details
build-examples	- 🔗	N/A	See Details
checks	- 🔗	N/A	See Details
lint	- 🔗	N/A	See Details
unittest (1.19)	- 🔗	N/A	See Details
unittest (1.18)	- 🔗	N/A	See Details
cross-compile (darwin, amd64)	- 🔗	N/A	See Details
cross-compile (darwin, arm64)	- 🔗	N/A	See Details
cross-compile (linux, 386)	- 🔗	N/A	See Details
cross-compile (linux, amd64)	- 🔗	N/A	See Details
cross-compile (linux, arm)	- 🔗	N/A	See Details
cross-compile (linux, arm64)	- 🔗	N/A	See Details
cross-compile (linux, ppc64le)	- 🔗	N/A	See Details
cross-compile (windows, 386)	- 🔗	N/A	See Details
cross-compile (windows, amd64)	- 🔗	N/A	See Details
build-package (deb)	- 🔗	N/A	See Details
build-package (rpm)	- 🔗	N/A	See Details
windows-msi	- 🔗	N/A	See Details
publish-check	- 🔗	N/A	See Details
publish-dev	- 🔗	N/A	See Details
publish-stable	- 🔗	N/A	See Details

✅ load-tests workflow has finished in 8 minutes 21 seconds (9 minutes 11 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

Job	Failed Steps	Tests
loadtest (TestIdleMode)	- 🔗	✅ 1 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceAttributesProcessor)	- 🔗	✅ 3 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestMetric10kDPS\|TestMetricsFromFile)	- 🔗	✅ 6 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceNoBackend10kSPS\|TestTrace1kSPSWithAttrs)	- 🔗	✅ 8 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestMetricResourceProcessor\|TestTrace10kSPS)	- 🔗	✅ 12 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestTraceBallast1kSPSWithAttrs\|TestTraceBallast1kSPSAddAttrs)	- 🔗	✅ 10 ❌ 0 ⏭ 0 🔗	See Details
loadtest (TestBallastMemory\|TestLog10kDPS)	- 🔗	✅ 19 ❌ 0 ⏭ 0 🔗	See Details
setup-environment	- 🔗	N/A	See Details

🔎 See details on Foresight

^{*You can configure Foresight comments in your organization settings page.}

sumo-drosiek · 2022-12-28T06:42:38Z

@djaglowski Is this change still wanted according to the #17079?

djaglowski · 2022-12-28T14:31:07Z

My understanding is that the issue is resolved by #17079.

The ConsumeLogs function is now only blocked when waiting to write to a channel, whereas we formerly were blocking with an expectation of being able to read from a channel. I would certainly appreciate your take on this as well, if you have time. Either way, thanks for all the effort you put into this PR.

sumo-drosiek · 2022-12-29T13:02:24Z

@djaglowski I rebased the PR. Backpropagation is added, but it's not used in logstransformprocessor for now.

Looking at the #17079 change, I'm not sure if asynchronous processors are align with the pipeline architecture. In current state it will take all logs and claim that they are processed correctly:

I can revert this to the desired state of my commit, but not sure what is the decision here.

djaglowski · 2022-12-29T14:06:42Z

I'm not sure if asynchronous processors are align with the pipeline architecture.

I may be wrong, but I don't think processors are necessarily expected to be synchronous. For example, batchprocessor is one of only two in the core collector and it is asynchronous.

In current state it will take all logs and claim that they are processed correctly

Can you elaborate on this? I'm not familiar with how this is communicated and what impact it has.

sumo-drosiek · 2022-12-30T07:02:35Z

In current state it will take all logs and claim that they are processed correctly

Can you elaborate on this? I'm not familiar with how this is communicated and what impact it has.

I'm not sure if asynchronous processors are align with the pipeline architecture.

I may be wrong, but I don't think processors are necessarily expected to be synchronous. For example, batchprocessor is one of only two in the core collector and it is asynchronous.

There is an error backpropagation:

https://github.com/open-telemetry/opentelemetry-collector/blob/e4e76dec22ecf8cf1d5b70af6d74215f0e05f0c0/processor/processorhelper/logs.go#L65-L71

Not really sure who and how handle the errors, but reminds me there is some issue with batchprocessor do not backpropagating the errors:

When exporter cannot handle more data and has to drop them it returns error to batchprocessor, but it is not backpropagated to receiver, so we cannot handle it (stop receiving data, ask client to stop sending them for some time). (@swiatekm-sumo correct me if I'm wrong)

Anyway, I think the PR is ready and making logstransformprocessor sync or not should be matter of another PR.

djaglowski · 2022-12-30T14:41:45Z

In order to justify the additional complexity, we need to have a clear design for how this would be used. It's no longer clear to me what these changes would accomplish. What will we do with the returned integer, specifically?

swiatekm · 2023-01-11T09:58:41Z

When exporter cannot handle more data and has to drop them it returns error to batchprocessor, but it is not backpropagated to receiver, so we cannot handle it (stop receiving data, ask client to stop sending them for some time). (@swiatekm-sumo correct me if I'm wrong)

Yeah, this is correct, and tracked in open-telemetry/opentelemetry-collector#4646.

In general, if we have an asynchronous component, that component essentially acts as a buffer and needs to take care of its own retry logic. Being asynchronous, it can't return errors to the caller, but it can and should retry on errors it gets from the rest of the pipeline. Otherwise it just drops data without providing any feedback and stops conducting backpressure.

github-actions · 2023-01-27T05:17:24Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

sumo-drosiek · 2023-01-30T09:19:45Z

@djaglowski any input here? I don't feel this change makes sense in other scenarios (maybe some metric to point how many data has been dropped in the queue)

djaglowski · 2023-01-30T15:01:42Z

In general, if we have an asynchronous component, that component essentially acts as a buffer and needs to take care of its own retry logic. Being asynchronous, it can't return errors to the caller, but it can and should retry on errors it gets from the rest of the pipeline. Otherwise it just drops data without providing any feedback and stops conducting backpressure.

Thanks for explaining this, it makes sense to me now why this processor would need special considerations for handling backpressure.

Do we have the same problem in the pkg/stanza receivers? They all share the same conversion/batching strategy. This design has been questioned in the past, mostly for its complexity. I wonder if the correct solution here is to make the shared converter synchronous.

sumo-drosiek · 2023-02-02T13:03:55Z

Do we have the same problem in the pkg/stanza receivers? They all share the same conversion/batching strategy. This design has been questioned in the past, mostly for its complexity. I wonder if the correct solution here is to make the shared converter synchronous.

It seems that errors are only logged out by the receiver. There is no try to handle recoverable errors or to inform client that data has been dropped

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/adapter/receiver.go#L134-L139

github-actions · 2023-05-26T22:01:22Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

sumo-drosiek · 2023-05-31T11:17:22Z

It seems that errors are only logged out by the receiver. There is no try to handle recoverable errors or to inform client that data has been dropped

@djaglowski Should I create separate issue for this?

djaglowski · 2023-05-31T13:18:24Z

@sumo-drosiek, I think a separate issue makes sense.

github-actions · 2023-06-16T05:19:06Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2023-06-30T05:19:55Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions bot added pkg/stanza processor/logstransform Logs Transform processor labels Nov 23, 2022

github-actions bot requested review from dehaansa and djaglowski November 23, 2022 12:30

sumo-drosiek mentioned this pull request Nov 23, 2022

[processor/logstransformprocessor] Processor hangs waiting for logs that were filtered out #15378

Closed

sumo-drosiek force-pushed the drosiek-investigation branch from 753168a to c57b55e Compare November 24, 2022 13:13

sumo-drosiek marked this pull request as ready for review November 24, 2022 14:50

sumo-drosiek requested a review from a team November 24, 2022 14:50

github-actions bot assigned dmitryax Nov 24, 2022

sumo-drosiek changed the title ~~pkg/stanza: attempt to fix logstransformprocessor flow~~ pkg/stanza: add backpropagation of number of processed entries Nov 25, 2022

djaglowski reviewed Nov 29, 2022

View reviewed changes

chrislbs mentioned this pull request Dec 4, 2022

logstransformprocessor deadlocks under load #16604

Closed

github-actions bot added the Stale label Dec 20, 2022

sumo-drosiek force-pushed the drosiek-investigation branch from 77e57c0 to 2861aa1 Compare December 27, 2022 10:45

djaglowski removed the Stale label Dec 27, 2022

feat(pkg/stanza): add backpropagation of number of processed entries

70bc711

sumo-drosiek force-pushed the drosiek-investigation branch from a4afe4b to 70bc711 Compare December 29, 2022 12:49

github-actions bot added the Stale label Jan 27, 2023

github-actions bot removed the Stale label Jan 31, 2023

djaglowski mentioned this pull request Feb 7, 2023

[pkg/stanza] Log converter has significant memory footprint at scale #18411

Closed

github-actions bot added the Stale label May 26, 2023

github-actions bot removed the Stale label Jun 1, 2023

github-actions bot added the Stale label Jun 16, 2023

sumo-drosiek mentioned this pull request Jun 19, 2023

Recoverable errors are not handled by stanza operators #23443

Closed

github-actions bot closed this Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/stanza: add backpropagation of number of processed entries #16452

pkg/stanza: add backpropagation of number of processed entries #16452

sumo-drosiek commented Nov 23, 2022 •

edited

Loading

djaglowski left a comment

sumo-drosiek commented Dec 5, 2022

djaglowski commented Dec 5, 2022

github-actions bot commented Dec 20, 2022

runforesight bot commented Dec 27, 2022 •

edited

Loading

`build-and-test-windows` duration(4 seconds) has decreased 45 minutes 13 seconds compared to main branch avg(45 minutes 17 seconds).

⭕ build-and-test-windows workflow has finished in 4 seconds (45 minutes 13 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ tracegen workflow has finished in 1 minute 5 seconds (2 minutes 54 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ check-links workflow has finished in 1 minute 37 seconds (1 minute 59 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ changelog workflow has finished in 2 minutes 12 seconds (5 minutes 51 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ prometheus-compliance-tests workflow has finished in 3 minutes 36 seconds (6 minutes 38 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ build-and-test workflow has finished in 35 minutes 41 seconds (22 minutes 39 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ load-tests workflow has finished in 8 minutes 21 seconds (9 minutes 11 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

sumo-drosiek commented Dec 28, 2022

djaglowski commented Dec 28, 2022

sumo-drosiek commented Dec 29, 2022

djaglowski commented Dec 29, 2022

sumo-drosiek commented Dec 30, 2022 •

edited

Loading

djaglowski commented Dec 30, 2022

swiatekm commented Jan 11, 2023 •

edited

Loading

github-actions bot commented Jan 27, 2023

sumo-drosiek commented Jan 30, 2023

djaglowski commented Jan 30, 2023

sumo-drosiek commented Feb 2, 2023

github-actions bot commented May 26, 2023

sumo-drosiek commented May 31, 2023

djaglowski commented May 31, 2023

github-actions bot commented Jun 16, 2023

github-actions bot commented Jun 30, 2023

pkg/stanza: add backpropagation of number of processed entries #16452

pkg/stanza: add backpropagation of number of processed entries #16452

Conversation

sumo-drosiek commented Nov 23, 2022 • edited Loading

djaglowski left a comment

Choose a reason for hiding this comment

sumo-drosiek commented Dec 5, 2022

djaglowski commented Dec 5, 2022

github-actions bot commented Dec 20, 2022

runforesight bot commented Dec 27, 2022 • edited Loading

Foresight Summary

build-and-test-windows duration(4 seconds) has decreased 45 minutes 13 seconds compared to main branch avg(45 minutes 17 seconds).

⭕ build-and-test-windows workflow has finished in 4 seconds (45 minutes 13 seconds less than main branch avg.) and finished at 29th Dec, 2022.

✅ tracegen workflow has finished in 1 minute 5 seconds (2 minutes 54 seconds less than main branch avg.) and finished at 29th Dec, 2022.

✅ check-links workflow has finished in 1 minute 37 seconds (1 minute 59 seconds less than main branch avg.) and finished at 29th Dec, 2022.

✅ changelog workflow has finished in 2 minutes 12 seconds (5 minutes 51 seconds less than main branch avg.) and finished at 29th Dec, 2022.

✅ prometheus-compliance-tests workflow has finished in 3 minutes 36 seconds (6 minutes 38 seconds less than main branch avg.) and finished at 29th Dec, 2022.

✅ build-and-test workflow has finished in 35 minutes 41 seconds (22 minutes 39 seconds less than main branch avg.) and finished at 29th Dec, 2022.

✅ load-tests workflow has finished in 8 minutes 21 seconds (9 minutes 11 seconds less than main branch avg.) and finished at 29th Dec, 2022.

sumo-drosiek commented Dec 28, 2022

djaglowski commented Dec 28, 2022

sumo-drosiek commented Dec 29, 2022

djaglowski commented Dec 29, 2022

sumo-drosiek commented Dec 30, 2022 • edited Loading

djaglowski commented Dec 30, 2022

swiatekm commented Jan 11, 2023 • edited Loading

github-actions bot commented Jan 27, 2023

sumo-drosiek commented Jan 30, 2023

djaglowski commented Jan 30, 2023

sumo-drosiek commented Feb 2, 2023

github-actions bot commented May 26, 2023

sumo-drosiek commented May 31, 2023

djaglowski commented May 31, 2023

github-actions bot commented Jun 16, 2023

github-actions bot commented Jun 30, 2023

sumo-drosiek commented Nov 23, 2022 •

edited

Loading

runforesight bot commented Dec 27, 2022 •

edited

Loading

`build-and-test-windows` duration(4 seconds) has decreased 45 minutes 13 seconds compared to main branch avg(45 minutes 17 seconds).

⭕ build-and-test-windows workflow has finished in 4 seconds (45 minutes 13 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ tracegen workflow has finished in 1 minute 5 seconds (2 minutes 54 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ check-links workflow has finished in 1 minute 37 seconds (1 minute 59 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ changelog workflow has finished in 2 minutes 12 seconds (5 minutes 51 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ prometheus-compliance-tests workflow has finished in 3 minutes 36 seconds (6 minutes 38 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ build-and-test workflow has finished in 35 minutes 41 seconds (22 minutes 39 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

✅ load-tests workflow has finished in 8 minutes 21 seconds (9 minutes 11 seconds less than `main` branch avg.) and finished at 29th Dec, 2022.

sumo-drosiek commented Dec 30, 2022 •

edited

Loading

swiatekm commented Jan 11, 2023 •

edited

Loading